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Abstract 

In this article, we describe the XML storage system used in 
the WebContent project. We begin by advocating the use of 
an XML database in order to store WebContent documents, 
and we present two different ways of storing and querying 
these documents : the use of a centralized XML database 
and the use of a P2P XML database. 

1 Context 

Overview: The WebContent platform [^proposes a spe- 
cific UML schema to be used by all its services. Through 
a canonical transformation, this schema can be converted 
into an XML Schema. This is extremely usefuU since the 
Web Services paradigm uses XML documents to commu- 
nicate with each other It seems therefore straightforward 
to manage all the documents inside the WebContent plat- 
form in XML format, which will present advantages when 
storing them and querying them. In this article, we describe 
two ways of managing the storage and querying of such 
documents, by using a centralized and distributed (P2P) 
XML database. These storage-service modules conform 
to the WebContent interface for storage. The main reason 
for chosing to use XML databases over a simple file stor- 
age format is twofold : performance and expressivity of 
queries, since as we will see, it is possible to express any 
sort of XQuery on a WebContent document. 

WebContent Storage Services Interface: The platform 
defines an interface for a storage service and consequently 
a query service, to access the data that is stored. These 
interfaces are generic. To illustrate their flexibility, we 
have provided two implementations 16] |T] . The first one 
provides storage and querying on top of existing single- 
site (centralized) XML database servers using an existing 
XML Query engine : MonetDB^ Second, we have imple- 
mented a resource store distributed over the network peers, 
and similarly a query service implemented jointly by all 
the peers in the network. We stress that moving from one 
implementation of the storage service to another is totally 
transparent to the user, and similarly for the query service. 
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2 Storage services 

Centralized Store: For storage on a single machine, we 
can use either MonetDB or MS SQL Server. In both cases, 
the WebContent documents are stored in their native XML 
format, and can be queried via XQuery. An issue with such 
queries is that they may return results of any (XML) type. 
Therefore, we have developed a specific WebContent query 
interface that only allows queries returning WebContent re- 
sources, which may be placed in the warehouse. 

P2P storage service: The P2P storage service is imple- 
mented jointly by several peers, so that the exact location 
of a piece of data is transparent to the user The P2P storage 
service also supports indexing facilities. A DHT service is 
implemented on top of a distributed hash table (or DHT, 
for short |4|). The DHT, a distributed software running on 
all peers, provides the connectivity of the network. It as- 
signs unique identifiers to peers and allows them to easily 
join and leave the networlrl Indexing is supported using 
a distributed data structure based on the simple abstraction 
of (key, value) pairs (with two services, namely put(k,v) and 
get(k)). Without delving into the details, the DHT stores all 
values associated to a given key k, on a particular peer in 
charge of that key. 

Different DHTs may have different algorithmic properties, 
interesting from different performance viewpoints. For in- 
stance, a DHT may guarantee that two keys ki and ^2, 
"close" by some distance measure, are managed by peers 
that are "close" in some sense. To take advantage of the 
good properties of distinct DHTs, several DHTs may co- 
exist in a WebContent deployment architecture. Thus, a 
peer p belonging to the DHTs dhti, dht2, ... is an end- 
point for the services joini , leavei , puti , geti , but also for 
join2,leave2,put2, get2 etc. We have successfully inte- 
grated so far two DHTs IT]: FreePastry |7 | from MIT, in- 
cluding our own extensions for robust scalable XML index- 
ing 12j|; and PathFinder |5 |, specially tuned to support in- 
terval search queries (which FreePastry does not support). 
The Active XML [^engine is responsible to interact with 
the available DHTs since their presence and query process- 
ing performed by each of them should be transparent to the 
user. 
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3 Query services 

XQuery: WebContent resource exploitation relies on ad- 
vanced query processing capabilities. To this end, we 
use XML query services. In its centralized (one-site) im- 
plementation, an XML query service takes as input an 
XQuery ^ and returns its results as evaluated by the un- 
derlying XML DB. Observe that in this context, it is only 
meaningful to solicit the query service on the machine that 
stores the queried document(s). XQuery is an extremely 
powerful language, and it is possible to write many com- 
plex queries in particular to restructure or perform joins 
on the documents. While the WebContent interface allows 
such queries to be written, the main functionality of the 
store is to provide access to any resource that is stored in 
the database. Recall that a resource can be anything from 
a document to one of its atomic resources such as a para- 
graph. Such queries are much simpler than XQueries, and 
are implemented in the centralized service using an index 
on all the resource elements. It is therefore possible to find 
in time 0(1) any resource stored in the database (serializa- 
tion cost is of course function of the size of the resource). 

P2P Query engine: The implementation of our P2P 
XML query service is more evolved. This service is pro- 
vided by any WebContent peers, and is implemented by 
several collaborating peers. The queries it supports may be 
asked against the set of all documents available in the ware- 
house, regardless of their location. The processing of such 
a query can be traced on the Figure [T] This figure shows a 
P2P WebContent network based on two superposed DHTs, 
such as KadoP and PathFinder which we integrated. Ac- 
cordingly, the detailed structure of peer pi features a tree 
pattern query processor for each of the DHTs. Classical 
database optimization techniques can be incorporated into 
each of these tree pattern query processors, e.g., a query 
cache has been built in the KadoP tree pattern query pro- 
cessor etc. 

The query is handled to a P2P optimizer service we de- 
veloped, namely OptimAX |3 1, which performs two tasks. 
(i) Based on the knowledge it has of the available DHT in- 
dices, and with the help of an embedded XQuery algebraic 
compiler |8 1, OptimAX extracts from the query: the maxi- 
mal subqueries that can be processed by the each available 
tree pattern query processors, and a recomposition query 
which assembles the results of index lookups into the de- 
sired query result form, (ii) The calls to the KadoP and/or 
PathFinder indexes are placed in the network of peers in 
such a way as to reduce the total amount of data transfers 
incurred by query processing. OptimAX is implemented as 
a rule-based rewriting engine and execution plans are en- 
coded as ActiveXML documents. Once OptimAX has pro- 
duced an execution plan, it is given to the AXML engine 
for execution. This is carried out by relying on the tree 
pattern query capabilities of KadoP |2 | and PathFinder Q, 
and on the XML query service local to the query peer for 
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Figure 1: Outline of P2P services. 



the recomposition query. 

Finaly, each peer is capable of precessing semantic queries 
over RDF data, expressed in a conjunctive subset of the 
SPARQL^lan guage. 
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